Resilience for Multigrid Software at the Extreme Scale

نویسندگان

Markus Huber

Björn Gmeiner

Ulrich Rüde

Barbara I. Wohlmuth

چکیده

Fault tolerant algorithms for the numerical approximation of elliptic partial differential equations on modern supercomputers play a more and more important role in the future design of exa-scale enabled iterative solvers. Here, we combine domain partitioning with highly scalable geometric multigrid schemes to obtain fast and fault-robust solvers in three dimensions. The recovery strategy is based on a hierarchical hybrid concept where the values on lower dimensional primitives such as faces are stored redundantly and thus can be recovered easily in case of a failure. The lost volume unknowns in the faulty region are re-computed approximately with multigrid cycles by solving a local Dirichlet problem on the faulty subdomain. Different strategies are compared and evaluated with respect to performance, computational cost, and speed up. Especially effective are strategies in which the local recovery in the faulty region is executed in parallel with global solves and when the local recovery is additionally accelerated. This results in an asynchronous multigrid iteration that can fully compensate faults. Excellent parallel performance on a current peta-scale system is demonstrated.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution sp...

متن کامل

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest that very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to ca...

متن کامل

The Need for Resilience Research in Workflows of Big Compute and Big Data Scientific Applications

Projections and reports about exascale failure modes conclude that we need to protect numerical simulations and data analytics from an increasing risk of hardware and software failures and silent data corruptions (SDC) [1, 4]. At this scale, hardware and software failures could be as frequent as ten or more per day. According to [9], the semiconductor industry will have increased difficulty pre...

متن کامل

Inter-Agency Workshop on HPC Resilience at Extreme Scale

The following report summarizes the proceedings of a three-and-a-half day inter-agency workshop focused on the technical challenges of HPC resilience in the 2020 Exascale timeframe. The resilience problem is not specific to any particular program or agency; coordinated resilience solutions will be challenging because of the need for a truly integrated approach. The interagency workshop therefor...

متن کامل

A quantitative performance analysis for Stokes solvers at the extreme scale

This article presents a systematic quantitative performance analysis for large finite element computations on extreme scale computing systems. Three parallel iterative solvers for the Stokes system, discretized by low order tetrahedral elements, are compared with respect to their numerical efficiency and their scalability running on up to 786 432 parallel threads. A genuine multigrid method for...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1506.06185 شماره

صفحات -

تاریخ انتشار 2015

Resilience for Multigrid Software at the Extreme Scale

نویسندگان

چکیده

منابع مشابه

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

The Need for Resilience Research in Workflows of Big Compute and Big Data Scientific Applications

Inter-Agency Workshop on HPC Resilience at Extreme Scale

A quantitative performance analysis for Stokes solvers at the extreme scale

عنوان ژورنال:

اشتراک گذاری